Creating a model that should be bad at predicting (but isn’t)¶
Imagine you’re a school principal and would like to find out which aspiring new kids will attain some level of college education later in life. You do this so you don’t have to waste any time and recourses on people you believe are ultimately undeserving. To do this you gather all the information you can about your new students like Borough (location), Disability, languages other than English spoken, Ethnicity, and Sex.
Ultimately this is a list of attributes that really should not have any influence on the level of education an individual will obtain, which is why the ML model used to predict this is hopefully bad.
Interestingly we would expect languages other than English spoken to have a negative impact on the level of education attained, as the majority of Americans only speak English, thus speaking another language than English is an indication the person is of another race. Of course, attributes like Ethnicity and Sex having an influence on education would go directly against SDGs 4, and 5.
Finally, as we saw in the heat maps there is a high concentration of Hispanics and less than high school education in the Bronx, thus it is logical to conclude that Borough would have an impact on education.
What is a classification model?¶
The process of predicting whether an individual will attain at least some college education or not (0 = less than high school and high school, 1 = some college and bachelor’s degree or higher) is called a classification model, and our classification model consists of decision trees.
A decision tree works by asking “yes/no” questions, for instance: “Is this person a male?”. This creates a split in the tree. Based on the answer to the question a new question is asked on each branch, creating new splits. Multiple splits are created in this way, such that for each split we get more and more information about our data. We select the questions such that the two resulting subgroups from the split are as different as possible, and the data points within each subgroup are as similar as possible. To calculate this we can use a measure called entropy. Finally, we have split the data into multiple subgroups, where for each subgroup there will be a higher probability of predicting the right target class, than if we had not asked any questions.
We then create multiple of these decision trees, all different and uncorrelated. To do a prediction we can then combine these decision trees and predict what most of the trees are predicting. This is what is called a random forest. The random forest minimizes errors in the classification because we get inputs from multiple decision trees, hence one wrong tree prediction will not make a difference as long as most trees predict correctly.
Finally, we are also performing a randomized search to select the best random forest model. The randomized search simply creates multiple random forest models with different parameters such as “number of trees in each forest”, “maximum number of levels in each tree” etc. It then runs and evaluates all models, selecting the random forest which performs best.
The depressingly good performence of our model¶
The binary classification model we have created “unfortunately” performs rather well. Unfortunately, this indicates external factors such as race, disability, location, etc. have an impact on achieving higher education in New York City. It means that we based on a person’s rather neutral features can predict whether or not this person has an education (or will get one). It is a feature that we believe should not have an influence on whether or not a person will receive higher education. We can compare our model to a baseline model, which predicts everyone to have higher education. Whereas our model makes a prediction for each person based on the previously mentioned features.
We get the following statistical performance measurements for the baseline and random forest model respectively:
Baseline model
Performance Measurement |
Performance |
|---|---|
Accuracy |
0.59 |
Precision |
0.59 |
Recall |
1.00 |
F1 score |
0.74 |
Random Forest Classifier
Performance Measurement |
Performance |
|---|---|
Accuracy |
0.65 |
Precision |
0.67 |
Recall |
0.79 |
F1 score |
0.73 |
These measures are all in the range of zero to one, where zero is the worst and 1 is the best. We see that the random forest model has the best accuracy and precision, whereas the baseline model has a better recall and f1 score. The accuracy is a measure of how many correct predictions are divided by the total amount of predictions. Hence our random forest model overall does predict more correctly than the baseline model. Precision and recall are two measures often seen together. One way to explain them is that precision is a measure of quality and recall is a measure of quantity.[precision, recall wiki]. Because precision tells you how many of the higher education predictions are correct, and recall tells you how many of the higher education instances you predicted are correct (also known as True positive (TP)). So naturally, when predicting every instance to be higher education, the recall will be 1.
Finally, the F1 score is a mean between the two measurements.
Hence to evaluate the model we need to look at what the model should be used for. If we assume a rather uncomfortable thought that the NYC government would predict who gets an education in order to know which people to spend recourses on, then their goal is probably to only invest in the people who get higher education. And not invest in a person and risk investing in someone who does not get the education. Here they would prefer a higher precision over a higher recall because this indicated that they seldom invest in the wrong people, but rather invest a little less, though they will miss some potential good investments. Hence in this case our model would actually be useful.
The above-stated thought is morally uncomfortable, but nonetheless, it is the sad truth, that we based on the data of the NYC situation in 2015 can see the segregation from the model and could use people’s neutral features to predict education.
Talking about segregation we can look into our model to see if it is biased, and in which areas it is most biased. In our case, we know that the model is biased since we only included features that should not have an influence. But we can take a look at which features have the most influence, and hence where the model is most biased. Therefore we have plotted the importance of the different features. This is simply a measure of how important each feature is for the prediction.
We see that Non-Hispanic White is the most important, and Hispanic, any Race is the third most important. (With Manhattan being the second most important). Hence the model sees the person’s race as an important attribute to predict the education level attained. This does not seem fair.
Luckily the sex does not seem that important, which is exactly what we saw in the bar plot of gender in each education level.
Next, we have plotted the normalized confusion matrix of the model’s overall performance. It shows the ratio between the actual educational status and the predicted educational status. These numbers are used to calculate the statistics we use to evaluate the model before. From the matrix, we see that our model is better at predicting the people who do get an education than the people who don’t get an education. Hence the model leans towards predicting that people will get a higher education.
Finally, we go back and look if our concern about the model being very biased toward race holds. To investigate this we have plotted the difference in the performance of the model for three ethnicities: Non-Hispanic White (white), Non-Hispanic Black (black), Hispanic Any Race (Hispanic).
On the y-axis we have the different labels TN (true negative: correctly predicted no higher education), FP (false positive: predicted higher education, but had no higher education), FN (false negative: predicted no higher education, but had higher education), and TP (true positive: correctly predicted higher education). The x-axis shows the difference between the chosen race and the overall model, which is not divided.
From the plot, we see that for the white people the model has much fewer true negatives and false negatives, and a lot more false positives and true positives. This means the model very seldom predicts a white person not to have an education and most frequently predicts white to have a higher education. For Hispanic and black people we see an opposite trend. Meaning the model tends towards predicting black and Hispanic people don’t have an education. Hence we were right in our concern about the bias in the model. This bias is simply due to the data being biased, indicating racial segregation in the city of New York. This is an unfortunate fact we cannot change.
Making the model fair¶
If we would use the model to predict education and do not want the model to be unfair to some races, we can debias the model according to the race. We have chosen white and Hispanic, since these are the most influential, and included black as well since there is a lot of history regarding black peoples’ educational rights (and of course rights in general). The idea behind debiasing is to make the model equally fair for all races. The way we wish to make the model fair is by getting a similar true positive rate (TPR= TP/(TP+FN)) and false positive rate (FPR=FP/(FP+TN)) for the races. Ideally, we want a high true positive rate and a low false positive rate.
For our current overall model, we have a TPR (true positive rate) of 0.79, and an FPR (false positive rate) of 0.54. But for the white, both of these measurements are higher, and for the Hispanics, both are much lower. We cannot change the model, but we can change how the model predicts. The model gives each person a probability of that person having a higher education. Normally if the probability is above 0.5 (50%), it predicts the person to have a higher education, if it’s under it predicts the person to not have a higher education. The probability of 0.5 is called the threshold. To debias the model we can change the threshold based on which race we are looking at. We will find which thresholds give us the best and most similar TPRs and FPRs for whites and Hispanics respectively.
To do this we’ve calculated TPRs and FPRs for both races for some different thresholds. We then plot the FPR on the x-axis and the TPR on the y-axis, where each point corresponds to a threshold. This is called an RUC curve.
Evaluating the RUC curve visually, we want to find a high TPR, low FPR, and three thresholds (points) that are close to each other making the rates for the races similar. We have selected the following three thresholds:
Black: 0.53
White: 0.68
Hispanic: 0.42
This means that for black the probability needs to be above 0.53 to predict higher education, whereas for white it needs to be above 0.68, and for Hispanics it only needs to be above 0.42.
The specific values for the thresholds that debiases our model reveals how segregated the population is. We see how much we need to change the threshold according to each other in other to achieve a fair model. Specifically, the threshold for Hispanics is almost two-thirds of white.
We can now take a look at the effect of our debiasing. To do this we have plotted the TPR and FPR for each race, before and after the debiasing, to see how we have minimized the difference between the races.
Hence after debiasing our chosen thresholds results in the following rates:
TPR |
FPR |
|
|---|---|---|
Black |
0.59 |
0.45 |
White |
0.65 |
0.34 |
Hispanic |
0.66 |
0.49 |
So now we have a quite fair model regarding the ethnicities (the rates are almost similar for the three races), and it performs rather okay. The TPR is certainly higher than the FPR, hence it predicts correctly more than it predicts incorrectly.
The path to a better future¶
Hopefully, the reader is convinced of the importance of education. Not only is it the standpoint of the UN that all countries should strive to ensure a high level of education in their population, as it is the best tool to achieve sustainable development, but we also saw a high positive correlation between income and education, which help against poverty.
We saw that females have on average a higher education than males, however, this did not result in females having a higher average salary, in fact, quite the opposite as males had an approximately 50% higher salary. And this worrying discrepancy is something we saw throughout the data. Hispanics were much lower educated than whites, with blacks being somewhere in the middle between the two.
This difference in the race had a big influence on the machine learning model we created. Being white meant that the model much more often thought an individual would have some level of college, whereas the opposite was true for Hispanics, and again Black was somewhere in the middle. Interestingly we saw that the borough of Manhattan also had a positive influence in predicting education. Thus we have two external factors (race and borough) that heavily influence the level of education an individual will attain. This is a problem, obviously, you would prefer these factors not to have any influence, and certainly not to the extent that they currently have.
If you would still want to use the model, we showed how to make the model predictions equally fair for the ethnicities. We did this by changing how certain the model has to be of predicting higher education. Effectively this means the model treats white people worse than it should and Hispanics better. It does this however to make up for the racial inequalities that exist in New York City.